15 research outputs found
TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning
Combining deep model-free reinforcement learning with on-line planning is a
promising approach to building on the successes of deep RL. On-line planning
with look-ahead trees has proven successful in environments where transition
models are known a priori. However, in complex environments where transition
models need to be learned from data, the deficiencies of learned models have
limited their utility for planning. To address these challenges, we propose
TreeQN, a differentiable, recursive, tree-structured model that serves as a
drop-in replacement for any value function network in deep RL with discrete
actions. TreeQN dynamically constructs a tree by recursively applying a
transition model in a learned abstract state space and then aggregating
predicted rewards and state-values using a tree backup to estimate Q-values. We
also propose ATreeC, an actor-critic variant that augments TreeQN with a
softmax layer to form a stochastic policy network. Both approaches are trained
end-to-end, such that the learned model is optimised for its actual use in the
tree. We show that TreeQN and ATreeC outperform n-step DQN and A2C on a
box-pushing task, as well as n-step DQN and value prediction networks (Oh et
al. 2017) on multiple Atari games. Furthermore, we present ablation studies
that demonstrate the effect of different auxiliary losses on learning
transition models
Deep Variational Reinforcement Learning for POMDPs
Many real-world sequential decision making problems are partially observable
by nature, and the environment model is typically unknown. Consequently, there
is great need for reinforcement learning methods that can tackle such problems
given only a stream of incomplete and noisy observations. In this paper, we
propose deep variational reinforcement learning (DVRL), which introduces an
inductive bias that allows an agent to learn a generative model of the
environment and perform inference in that model to effectively aggregate the
available information. We develop an n-step approximation to the evidence lower
bound (ELBO), allowing the model to be trained jointly with the policy. This
ensures that the latent state representation is suitable for the control task.
In experiments on Mountain Hike and flickering Atari we show that our method
outperforms previous approaches relying on recurrent neural networks to encode
the past
Auto-Encoding Sequential Monte Carlo
We build on auto-encoding sequential Monte Carlo (AESMC): a method for model
and proposal learning based on maximizing the lower bound to the log marginal
likelihood in a broad family of structured probabilistic models. Our approach
relies on the efficiency of sequential Monte Carlo (SMC) for performing
inference in structured probabilistic models and the flexibility of deep neural
networks to model complex conditional probability distributions. We develop
additional theoretical insights and introduce a new training procedure which
improves both model and proposal learning. We demonstrate that our approach
provides a fast, easy-to-implement and scalable means for simultaneous model
learning and proposal adaptation in deep generative models
My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control
Multitask Reinforcement Learning is a promising way to obtain models with
better performance, generalisation, data efficiency, and robustness. Most
existing work is limited to compatible settings, where the state and action
space dimensions are the same across tasks. Graph Neural Networks (GNN) are one
way to address incompatible environments, because they can process graphs of
arbitrary size. They also allow practitioners to inject biases encoded in the
structure of the input graph. Existing work in graph-based continuous control
uses the physical morphology of the agent to construct the input graph, i.e.,
encoding limb features as node labels and using edges to connect the nodes if
their corresponded limbs are physically connected. In this work, we present a
series of ablations on existing methods that show that morphological
information encoded in the graph does not improve their performance. Motivated
by the hypothesis that any benefits GNNs extract from the graph structure are
outweighed by difficulties they create for message passing, we also propose
Amorpheus, a transformer-based approach. Further results show that, while
Amorpheus ignores the morphological information that GNNs encode, it
nonetheless substantially outperforms GNN-based methods
Exploration in Approximate Hyper-State Space for Meta Reinforcement Learning
To rapidly learn a new task, it is often essential for agents to explore
efficiently -- especially when performance matters from the first timestep. One
way to learn such behaviour is via meta-learning. Many existing methods however
rely on dense rewards for meta-training, and can fail catastrophically if the
rewards are sparse. Without a suitable reward signal, the need for exploration
during meta-training is exacerbated. To address this, we propose HyperX, which
uses novel reward bonuses for meta-training to explore in approximate
hyper-state space (where hyper-states represent the environment state and the
agent's task belief). We show empirically that HyperX meta-learns better
task-exploration and adapts more successfully to new tasks than existing
methods.Comment: Published at the International Conference on Machine Learning (ICML)
202
VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Trading off exploration and exploitation in an unknown environment is key to
maximising expected return during learning. A Bayes-optimal policy, which does
so optimally, conditions its actions not only on the environment state but on
the agent's uncertainty about the environment. Computing a Bayes-optimal policy
is however intractable for all but the smallest tasks. In this paper, we
introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to
perform approximate inference in an unknown environment, and incorporate task
uncertainty directly during action selection. In a grid-world domain, we
illustrate how variBAD performs structured online exploration as a function of
task uncertainty. We further evaluate variBAD on MuJoCo domains widely used in
meta-RL and show that it achieves higher online return than existing methods.Comment: Published at ICLR 202
Hierarchical Imitation Learning for Stochastic Environments
Many applications of imitation learning require the agent to generate the
full distribution of behaviour observed in the training data. For example, to
evaluate the safety of autonomous vehicles in simulation, accurate and diverse
behaviour models of other road users are paramount. Existing methods that
improve this distributional realism typically rely on hierarchical policies.
These condition the policy on types such as goals or personas that give rise to
multi-modal behaviour. However, such methods are often inappropriate for
stochastic environments where the agent must also react to external factors:
because agent types are inferred from the observed future trajectory during
training, these environments require that the contributions of internal and
external factors to the agent behaviour are disentangled and only internal
factors, i.e., those under the agent's control, are encoded in the type.
Encoding future information about external factors leads to inappropriate agent
reactions during testing, when the future is unknown and types must be drawn
independently from the actual future. We formalize this challenge as
distribution shift in the conditional distribution of agent types under
environmental stochasticity. We propose Robust Type Conditioning (RTC), which
eliminates this shift with adversarial training under randomly sampled types.
Experiments on two domains, including the large-scale Waymo Open Motion
Dataset, show improved distributional realism while maintaining or improving
task performance compared to state-of-the-art baselines.Comment: Published at IROS'2